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ABSTRACT 

Most of the work on query evaluation in probabilistic data- 
bases has focused on the simple tuple-independent data mo- 
del, where tuples are independent random events. Sev- 
eral efficient query evaluation techniques exists in this set- 
ting, such as safe plans, algorithms based on OBDDs, tree- 
decomposition and a variety of approximation algorithms. 
However, complex data analytics tasks often require com- 
plex correlations, and query evaluation then is significantly 
more expensive, or more restrictive. 

In this paper, we propose MVDB as a framework both 
for representing complex correlations and for efficient query 
evaluation. An MVDB specifies correlations by views, called 
Marko Views, on the probabilistic relations and declaring 
the weights of the view's outputs. An MVDB is a (very 
large) Markov Logic Network. We make two sets of con- 
tributions. First, we show that query evaluation on an 
MVDB is equivalent to evaluating a Union of Conjunctive 
Query(UCQ) over a tuple-independent database. The trans- 
lation is exact (thus allowing the techniques developed for 
tuple independent databases to be carried over to MVDB), 
yet it is novel and quite non-obvious (some resulting proba- 
bilities may be negative!). 

This translation in itself though may not lead to much 
gain since the translated query gets complicated as we try 
to capture more correlations. Our second contribution is to 
propose a new query evaluation strategy that exploits offline 
compilation to speed up online query evaluation. Here we 
utilize and extend our prior work on compilation of UCQ. 
We validate experimentally our techniques on a large proba- 
bilistic database with Marko Views inferred from the DBLP 
data. 

I. INTRODUCTION 

The task of analyzing and extracting knowledge from large 
datasets often requires probabilistic inference over a com- 
plex probabilistic model on the data. This step represents a 
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major challenge. Most of the scalable query processing tech- 
niques developed for probabilistic databases assume that the 
tuples are independent events, or disjoint-independent [4, 
1, 24]. For example, MystiQ, MayBMS, and SPROUT re- 
port running times of a few seconds on databases of tens of 
millions of tuples, using a combination of techniques such 
as safe plans [7], plan re-orderings and functional depen- 
dencies [23], Monte-Carlo simulation combined with top-/c 
optimization [28], or approximate confidence computation 
that tradeoff precision for performance [24]. These systems 
scale to quite large databases, but are limited to indepen- 
dent probabilistic databases, or disjoint-independent. 

Tuple-independent probabilistic databases are insufficient 
for analyzing and extracting knowledge from practical data- 
sets. As has been shown in the Machine Learning commu- 
nity, modeling correlations is critical in complex knowledge 
extraction tasks. For example, in Markov Logic Networks 
(MLN) [29], users can assert arbitrary probabilistic state- 
ments over the data, in the form of First Order Logic sen- 
tences, and assign a weight. The sentence, called a feature, is 
expected to hold to a degree indicated by the weight. Each 
feature may introduce correlations between a large number 
of base facts, and thus the MLN can express, very concisely, 
a large Markov Network. MLNs have been demonstrated to 
be effective at a variety of tasks, such as Information Ex- 
traction [26], Record Linkage [31], Natural Language Pro- 
cessing [27] . A benefit of MLNs is that the same framework 
can be used both for learning the weights, and for inferring 
probabilities of new queries. 

In this paper we present a new approach for represent- 
ing and querying probabilistic databases. Our data model 
combines probabilistic databases with MLNs: it consists of 
a collection of probabilistic (tuples are annotated with a 
probability) and deterministic tables, and a collection of 
views, called Marko Views. A Marko View is expressed 
by a Union of Conjunctive Queries (UCQ) over the proba- 
bilistic and deterministic tables, and associates a weight to 
each tuple in the answer; intuitively, it asserts a likelihood 
for that output tuple, and therefore introduces a correla- 
tion between all contributing input tuples. A Marko View 
can be seen as a set of MLN features, and thus, its weights 
can be learned as in MLNs; we do not address learning in 
this paper, but focus solely on inference, or query evalua- 
tion. We call a database consisting of probabilistic tables 
and Marko Views an MVDB. The data model of MVDBs 
is significantly richer than that of tuple-independent proba- 
bilistic databases, which we denote with INDB. 

We make two sets of contributions. First, we show how 
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to translate query evaluation over an MVDB into query 
evaluation over an INDB. More precisely, we express the 
probability P(Q) on an MVDB in terms of the probabil- 
ity Po(Q V W), where W is a union of queries, one for each 
MarkoView; therefore W is a UCQ. To be precise, we show 
that P(Q) = P (Q | -W) = (Po(Q V WO - P (W))/(1 - 
Po(W)). The probability Po is on an INDB obtained from 
the MVDB through a simple, yet quite non-obvious trans- 
formation, discussed in Sect. 3. Without going into the 
transformation details, we would like to mention that 
logically stands for the Marko Views, hence intuitively the 
translation is computing the probability of the query given 
that the views hold. Note that if Q is a UCQ, so is the 
translated query; hence, both the query and the probabilistic 
model are simple and come from well-understood domains. 
In particular, while there are very few results on tract abil- 
ity of MLNs and none complete, the set of tractable UCQ 
over INDB is already known [8]. Therefore, the translation 
moves our problem into a domain that is well-understood 
and allows for easier detection of tractable instances. We 
are aware of one more such translation [12] in the literature, 
but it leads to a more complicated query, which is no longer 
a UCQ. In contrast, our translation leads to both a sim- 
ple query (UCQ) and a simple model (tuple independent), 
where the complexity is well understood and the tractable 
cases are fully characterized. 

Our second set of contributions is to devise an efficient 
query evaluation method for MVDB. The task is to com- 
pute P (Q V W), where both Q and W are UCQs. Note 
that W depends only on the Marko Views, and does not 
depend on the query Q. We describe a new index struc- 
ture for W, called an MV-index, and show how to use it to 
compute Po(Q V W). The MV-index consists of an Ordered 
Binary Decision Diagram (OBDD) [5], extended with addi- 
tional information that is critical for computing Po(Q V W) 
efficiently. In prior work [15] we have shown that a certain 
class of UCQ queries, called inversion- free queries, always 
have an OBDD whose size is linear in the size of the active 
domain of the database. Here, we extend that construction 
to an algorithm that constructs an OBDD for any W: in 
the particular case when W is inversion- free, the resulting 
OBDD is guaranteed to be linear in the size of the database. 
The OBDD itself is not sufficient for computing Pq{Q V W) 
efficiently: state of the art synthesis algorithms require a 
time proportional to the product of the sizes of the two 
OBDDs for Q and W respectively. We describe here how 
to extend the OBDD to an MV-index, then present a top- 
down evaluation algorithm for computing P(Q V W), which 
traverses only a small fraction of the MV-index. 

Running Example We illustrate in Fig. 1 how Marko- 
Views can be used to add domain knowledge to DBLP [19], 
a database consisting of a few million tuples. We use an 
approach developed in the AI community for inferring new 
relations, such as advisor, or affiliation, from a database of 
citations [29, 30, 18]. Any MVDB has three parts. 

First, the deterministic database. In our example its is de- 
scribed at the top of Fig. 1, and consists of four base tables, 
Author, Wrote, Pub, HomePage, and two materialized views 
(FirstPub (aid, year) , which records for each author the 
year of her first publication, and DBLPAf f iliation(aid, ins 
t), which associates an affiliation to some authors 1 ). 

1 We computed the institute from the person's Webpage, 
when it was available in DBLP. For example, both Luis 



Second, an MVDB has probabilistic tables, shown in the 
middle of Fig. 1: Student^ (aid, year) stores likely years 
when an author was a student, Advisor p (aidl ,aid2) stores 
likely ad visor /advisee relationship, and Affiliation 25 (aid, i 
nst) records inferred affiliations based on co-authorship. 
Each probabilistic table is defined by a query, which also 
associates a weight to every output tuple; for example, 
Student p (aid,year)[e 1 -- 15(year - year ' ) ] : - associates the 
weight w = e i--i5(year-year') tQ itg output Weights are often 
preferred over probabilities when the probability function is 
a product of potential functions, as in MLNs and MVDBs. 
The intuition is that the weight w represents the odds of a 
probability p, w = p/(l — p) (formal definition in Sect. 2). 

Third, the MVDB contains a set of Marko Views, which 
in our example are shown at the bottom of Fig. 1. Each 
MarkoView is a query over probabilistic tables, and its 
purpose is to define some correlations between the tuples 
in those tables. It does this by defining a view over the 
probabilistic tables, then asserting a certain weight for the 
tuples in the view. Weights < 1 define a negative correlation, 
weights > 1 define a positive correlation, and a weight = 1 
means independence. A weight = means a hard constraint: 
the view must be empty. For example, the MarkoView 
VI defines a correlation between a tuple in Student^ and 
a tuple in Advisor p : it states that the more papers two 
people co-author during the years when the second person 
was a student, the more likely that the first person was his 
advisor. V2 defines a hard constraint: each person can have 
only one advisor. Finally, V3 introduces positive correlations 
between common affiliations for people who published a lot 
together. 

Consider now the following simple query on the MVDB: 
find all students advised by Sam Madden. The query, writ- 
ten over the MVDB, is shown in Fig. 2 (a). If the tuples 
in Student 25 and Advisor p were independent random vari- 
ables, then this query could be computed very efficiently, 
because it is a safe query [7]. However, Marko Views in- 
troduce correlations between the probabilistic tuples. We 
show in this paper that the probability of an answer aid, 
P(Q(aid)), can be expressed in terms of the probability of 
a query Po(Q(aid) V W) over a tuple independent database. 
We give the exact formula in Fig. 2 (c). The new INDB has 
five tuple-independent probabilistic tables: Student^ and 
Advisor p (which have the same sets of possible tuples as in 
the MVDB, and with the same weights), and NV1 P , NV2 P , 
NV3 P , which are three new, tuple-independent probabilistic 
tables, whose possible tuples are obtained from the Mark- 
O Views VI, V2, V3, and whose weights are derived from the 
latter through the formula (1 — w)/w. Note that if w > 1, 
then the translated weight is negative, which, in turn, corre- 
sponds to a negative probability; thus, the INDB may have 
some tuples with negative probabilities! However, the ex- 
pression for P(Q) is exact, hence its value is guaranteed in 
[0, 1]. We discuss the translation from Marko Views to IN- 
DBs in Sect. 3, and, in particular, the implications of having 
negative probabilities in a database in Sect. 3.3. 

Query evaluation on an MVDB reduces to evaluating for- 
mula like Fig. 2 (c), on an INDB. The query dependent part 
of this expression is Po(Q (aid) VW). While this requires stan- 
dard evaluation of a query over a tuple-independent data- 

Gravano www.cs.columbia.edu/~gravano and Ken Ross 
www.cs.columbia.edu/~kar have the same institute, www. 
cs . Columbia. edu. 
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Tables obtained from DBLP 



Table 


# Tuples 


Author (aid, name) 
Wrote (aid, pid) 
Pub(pid, title, year) 
HomePage (aid, url) 


1M 
4.5M 
1.7M 
18.7K 



Derived tables (standard views) 



Table 


# Tuples 


FirstPub (aid, year) 
DBLPAf f iliation(aid, inst) 


1M 
18.7K 



Three probabilistic tables Student 29 , Advisor p , Affiliation 25 . 



Possible Tuples 


Description 


Size 


Student 29 (aid, year) [exp(l-. 15* (year-year ' ) )] :- 

FirstPub (aid, year ' ) , year' - 1 <= year <= year' + 5 


aid was a student in that year if his first 
publication was not long before. 


6M 


Advisor 29 (aidl , aid2) [exp ( . 25*count (pid) ) ] : - 

Student (aidl , year) , Wrote (aidl , pid) , Wrote (aid2 , pid) , 
Pub (pid, title, year ) , not Student (aid2 , year) , 
count (pid) > 2 


aid2 was aidl's advisor if they published 
enough papers together while aidl was a 
student and aid2 was not. 


.25M 


Affiliation 25 (aid, inst) [exp( . l*count (pid) )] : - 
Wrote (aid, pid) , Wrote (aid2 , pid) , 
DBLPAf filiation(aid2, inst) , Pub (pid, title, year) , 
aid<>aid2, year>2005, not DBLPAf filiation(aid, inst2) 


aid's affiliation is inst if she published 
recently with people from inst 


.27M 



The Marko Views in the MVDB 



View definition 


Description 


Size 


VI (aidl , aid2) [count (pid) /2] : - Advisor 25 (aidl , aid2) , 
Student 25 (aidl , year) , Wrote (aidl , pid) , 
Wrote (aid2, pid) , Pub (pid, title, year) 


The more they published together while 
aid2 was a student, the more likely aidl 
was his advisor 


.25M 


V2 (aidl , aid2 , aid3) [0] : - Advisor 25 (aidl , aid2) , 
Advisor 25 (aidl, aid3) , aid2 <> aid3 


A person has only one advisor 


.38M 


V3 (aidl , aid2 , inst) [count (pid) /5] : - 

Affiliation 25 (aidl , inst) , Af filiation 25 (aid2 , inst) , 
Wrote (aidl , pid) , Wrote (aid2 , pid) , Pub (pid, title, year) , 
year > 2004, count (pid) > 30 


If two people have published a lot to- 
gether recently, then their affiliations are 
very likely to be same 


1.5K 



Figure 1: An illustration of MarkoViews on the DBLP database 



base, it can still be a major challenge: the reason is that 
the lineage of W is usually large, because it includes most 
probabilistic tuples and all tuples in the Marko Views. By 
contrast, the lineage of Q is much smaller, because it has 
a selection predicate ( c 7oMadden°/ " ) . This is why we use the 
strategy of compiling W offline. More precisely, we construct 
an MV-index, by first converting W into an OBDD, then 
adding appropriate pointer structures. Using the MV-index, 
query evaluation can be sped up dramatically: Q took 15 ms 
and returned results for all 48 advisors similarly named. We 
describe query compilation in Sect. 4, and describe experi- 
ments in Sect. 5. 

2. DEFINITIONS 

We use R to denote a relational schema with relation 
names Ri, Ri, . . . , Rk- We assume each relation has a key. 2 
A database instance / is a /c-tuple (R{, . . . , R{), where R( is 
an instance of the relation Ri; with some abuse of notation 
we drop the superscript / and write Ri for both the relation 
name and the instance of that relation. 

2.1 Probabilistic Databases 

A probabilistic database is a pair D = (W,P), where 
W = {ii, . . . , In} is a set of instances, called possible worlds, 
and P : W —> [0, 1] is a function such that J2j=i n = 
1. Thus, the instance is not known with certainty: every 

2 As usual, if there is no natural key, then the set of all 
attributes constitutes a key. 



possible world Ij has some probability, P(Ij). A relation 
Ri is called deterministic if it has the same instance in all 
possible worlds R^ 1 — • • • — R\ n ; otherwise, we say that 
Ri is probabilistic, and we sometimes add the superscript p, 
writing R? to indicate that Ri is probabilistic. For exam- 
ple in Fig. 1, the deterministic relations are Author, Wrote, 
Pub, HomePage, and the probabilistic relations are Student 29 , 
Advisor p , Affiliation 29 . 

We denote Tup the set of possible tuples, i.e. the set of 
all tuples occurring in all possible worlds ii, . . . ,In- The 
tuples in Tup include the relation name where they come 
from, e.g. the tuples R(a, b) and S(a, b) are considered dis- 
tinct tuples in Tup. We associate to each tuple t £ Tup 
a Boolean random variable, denoted Xt: given a random 
world I 3 ; , X t = if t Ij and X t = 1 if t £ Ij . The proba- 
bility of the event 3j. t £ Ij is denoted P(t) or P(X t ), and 
is also called the marginal probability of the tuple t. 

A query is denoted Q(x), where x are called free variables, 
or head variables. The answer to Q on an instance /, Q(I), 
is the set of all tuples a s.t. I \= Q(a), where Q(a) is the 
Boolean query obtained by substituting the head variables 
x with the constants a. The answer on a probabilistic data- 
base, D, is a set of pairs of the form (a,p), where a £ Q(I) 
for some possible world /, and: 



p=P(Q(a))= J2 

i-.aeQih) 
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Q(aid) : — Student^ (aid), Advisor p (aid, aidl) , 
Author 25 (aid,n) , Author^ (aidl ,nl) , 
nl like 1 '/.Madden'/. ' 

(a) 

Wl : -NVl p (aidl,aid2) , Advisor p (aidl , aid2) , 
Student 25 (aidl , year) , Wrote (aidl , pid) , 
Wrote (aid2, pid) , Pub(pid, title , year) 

W2 : -NV2 P (aidl , aid2 , aid3) , Advisor p (aidl , aid2) , 
Advisor p (aidl,aid3) , aid2 <> aid3 

W3 : -NV3 p (aidl,adi2,inst) , Affiliation 29 (aidl , inst) , 
Af filiation 29 (aid2, inst) , Wrote (aidl ,pid) , 
Wrote (aid2, pid) , Pub(pid, title, year) , 
year > 2004, count (pid) > 30 
W : -Wl V W2 V W3 

(b) 

P(q(aid))= ft W ( ^^;- ft W 
1 - Po{\J) 

(c) 

Figure 2: A query Q over the MVDB (a); helper 
queries on the INDB(b); expressing the probability 
on the MVDB in terms of the probability on the 
INDB (c). 

The queries we consider in this paper are Unions of Con- 
junctive Queries, denoted UCQ, which are expressions of 
the form Q(x) = Qi(x) V . . . V Q m (x), where each Qi(x) is 
a conjunctive query, i.e. has the form 3yi.y>i(x,yi) where 
ipi is a conjunction of positive, relational atoms, and/or in- 
equality predicates, such as z < 5. We write queries in 
datalog notation, indicating the head variables. For ex- 
ample, Q(x) = R(x),S(x,y) denotes the query 3y.R(x) A 
S(x,y), while Q = R{x),S(x,y) denotes the Boolean query 
3x.3y.R(x) A S(x,y). With some abuse of notation, we al- 
low the use of aggregates and negations, but only on the 
deterministic tables 3 . 

2.2 Tuple-Independent Databases 

A probabilistic database is tuple-independent if, for any set 
of possible tuples t\ , £2, • • • , t n , the events X tl , X t2 , . . . , X tn 
are independent. We write Do for a tuple-independent data- 
base, and also denote it with INDB. It is uniquely defined 
by a pair (Tup,p), where Tup is the set of possible tu- 
ples and p : Tup — > [0, 1] is any function. The possible 
worlds are all subsets I C Tup, and their probabilities are 

^) = n te /p(*)-nt e ^ P -/(i-p(*))- 

Tuple-independent probabilistic databases are the sim- 
plest, and most intensively studied types of probabilistic 
databases [33]. Even though the input tuples are indepen- 
dent, correlations are introduced during query evaluation, 
and query evaluation is, in general, #P-complete, e.g. for 

3 For example we used count (pid) in V3 in Fig. 1, mean- 
ing that the subquery consisting of the last two lines is first 
computed as a view over the deterministic tables, then the 
resulting view is used as a single table in V3; after this trans- 
formation, V3 becomes a conjunctive query over probabilistic 
and deterministic tables. 



the Boolean query Q = R(x), S(x,y),T(y). However, these 
probabilistic databases are now well understood, and a com- 
plete characterization of UCQ queries into #P-complete and 
PTIME queries exists [8]. In addition, many practical meth- 
ods for query evaluation on tuple-independent databases 
have been proposed in the literature (see Sect. 6). 

2.3 Markov Logic Networks (MLNs) 

A Markov Logic Network is a set L = {(Fi,wi), . . . , (F m , 
Wm)}j where each Fi is a formula in First Order Logic called 
a feature, and each Wi is a number called the weight [29, 
9]. The formulas are over a relational vocabulary R, and 
may have free variables, which are interpreted as being uni- 
versally quantified. Let C be a finite set of constants. A 
grounding of a formula Fi is a formula of the form Fi[d/x], 
where the free variables x of Fi are substituted with some 
constants a in C; let G(Fi) denote the set of grounding of Fi, 
and let G(L) = {{G, w % ) \ 3(F t ,w t ) e L : G G G(F t )} be the 
grounded MLN. Let Tup be the set of ground tuples with 
the relation symbols in R and constants in C. The weight 
<£>(/) of a world I C Tup, and the partition function Z are: 

*(/)= n w (i) 

(G,w)EG(L):I\=G 

z= £ *(/) (2) 

JCTup 

Definition 1. The semantics of an MLN L is the prob- 
abilistic database Dl = (W, P), where W = {/ | / C Tup} 
and P(I) = ®{I)/Z for all I C Tup. 

The intuition is the following. Any subset of tuples is a 
possible world, and its weight is the product of the weights 
of all grounded features that are true in that world. The 
probability is obtained by normalizing with Z. Note that a 
feature weight w > 1 means that worlds where the feature 
holds are more likely; w < 1 means that worlds were the 
feature holds are less likely; and w = 1 means indifference. 
A weight w — 00 is interpreted as a hard constraint: only 
worlds that satisfy the feature are considered as possible. 
This can be seen by letting w —> 00 in the expression P(I) = 

MLNs have been used in several applications of Machine 
Learning [31, 26, 27, 9]. One reason for their popularity 
is that they use the same formalism for both learning (of 
the weights w) and for probabilistic inference. There are 
two types of inferences within MLNs: MAP (maximum a 
posteriori) inference, which computes the most likely world, 
and marginal inference, which sums the probabilities of all 
worlds satisfying the query. In this paper we only address 
the latter, but our solutions easily generalize to solve the 
MAP inference problem as well. 

Tuple-Independent Databases Revisited Consider 
two possible tuples: R(ai),R(a2) : and the MLN consisting 
of features: (R(ai), w±), (Rfa), W2). There are four pos- 
sible worlds, ^, {R(a 1 )},{R(a 2 )},{R(a 1 ),R(a 2 )}, and their 
weights are: 

1 Wl W'2 W1W2 

The partition function is Z — 1 + w± + w 2 + w\w 2 — (1 + 
wi ) ( 1 +w 2 ) • In this case the MLN defines a tuple- independent 
database, where the tuples R(ai),R(a2) have probabilities 
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Pi = wi/(l + wi), and p2 = wz/(l + W2). Indeed, the four 
possible worlds have probabilities 

(l-pi)(l-p 2 ) Pl(l-j?2) (1-Pl)j?2 P1P2- 

and this distribution is equivalent to the above (up to the 
multiplicative factor Z). More generally, we define: 

Definition 2. A tuple-independent database, INDB, is 
a pair Do = (Tup ,wo) where Tup is a set of possible 
tuples and wo(t) associates a real number to each tuple t. 

This definition is equivalent to the one given earlier, by 
setting the tuple probability to p{t) — wo(£)/(l + wo(£)). 
Note that in a tuple-independent database a weight, w, rep- 
resents the odds, w — p/(l — p). In other words, weight 
values of 0, 1, 00 correspond to probabilities 0, 1/2, 1 respec- 
tively. From now on, unless otherwise stated, we will assume 
in the rest of this paper that an INDB is given as in Def. 2, 
that is, we are given the weights of the tuples, not their prob- 
abilities. The tuple's probability can always be recovered as 
w/(l + w). 

2.4 MVDBs 

In this section we introduce Markov Views (MarkoView) 
and Markov View Databases (MVDB). 

Definition 3. A MarkoView is a rule of the form: 

V(x)[w expr }:-Q (3) 

where V is the view name, Q is a UCQ, x are head vari- 
ables, and w expr is an expression representing a non-negative 
weight. 

Let R be a relational schema. An MVDB is a triple 
(Tup,w, V), where Tup is a set of possible tuples over the 
schema R, w : Tup — > [0, 00] associates a weight to each 
possible tuple, and V is a set of Marko Views. 

Let /poss denote the deterministic database instance over 
the schema R consisting of all possible tuples Tup (for- 
getting their probabilities). For each MarkoView V, de- 
note Tup v the result of evaluating V on / p0 ss, and let 
Tup v = U v Tup v : this is the set of all possible tuples in 
all views. For each t £ Tup v , let wy(t) denote its weight, 
as computed according by the view V. 

The semantics of an MVDB is a restricted MLN having 
one feature (F t ,w t ) for each tuple t G Tup U Tup F , defined 
as follows. For each possible tuple t £ Tup in the proba- 
bilistic database we associate the feature where the formula 
is Ft = t, the grounded atom represented by the tuple t, and 
the weight is u>t = w(t). For each possible tuple t £ Tup v 
in a view V, consider the view definition V(x) [u> e x P r] : ~Q' we 
associate t with the feature where the formula is F t — Q(t), 
the Boolean query obtained by substituting the head vari- 
ables x with the tuple t, and the weight is wy(t). Thus, the 
feature F t is a ground tuple in the first case, and a Boolean 
UCQ in the second case. In both cases, Ft has no free vari- 
ables, and therefore it is already "grounded" according to 
the terminology of MLN's. 

Definition 4. Let (Tup,w,V) be an MVDB. Its se- 
mantics is given by the probabilistic database Dl (Def. 1) 
associated to the MLN L = {(F t ,w t ) \ t £ Tup U Tup v }. 

We denote P(Q) the probability of a Boolean query Q 
on the probabilistic database associated to the MVDB. The 
problem in this paper is to compute P(Q). 



2.5 Discussion and Examples 

An MVDB generalizes tuple-independent databases. Any 
INDB, (Tup,w) is in particular a MVDB, (Tup,w,0), 
without any Marko Views. But MVDBs are much more 
powerful than INDBs, because they can impose correlations 
between arbitrary sets of tuples. We illustrate with several 
examples. 

Example 1. Consider the MVDB with two possible tu- 
ples, Tup = {R(a),S(a)}, with weights w±,W2 respectively, 
and a single MarkoView: 

V(x)[w] : -R(x),S(x) 

Here w is a constant. Intuitively, the view asserts that the 
tuples in R and 5 are correlated, by some weight w. We 
show now the associated MLN, L. There is a single tuple in 
the MarkoView, Tup v = {V(a)}, and therefore the MLN 
has three features: 

L ={(R(a),wi), (5(a), w 2 ), (R(a) A S(a),w)} 

The probabilistic database Dz, has four possible worlds, 
0, {R(a)}, {5(a)}, {R(a), 5(a)}, with weights: 

1 W\ U>2 WW1W2 

Therefore, the two tuples R(a),S(a) are correlated. When 
w = 0, then R(a) and 5(a) are exclusive events; when w = 1 
then they are independent events; when w = 00 then both are 
certain tuples. More generally, when w < 1 then R(a),S(a) 
are negatively correlated, and when w > 1 they are positively 
correlated. 

Example 2. A more complex example is V(x)[w] = R(x), 
S(x,y). Each tuple t = V(a) in the view defines the MLN 
feature Ft = 3y.R(a), S(a,y) . The lineage of this Boolean 
query is (R(a) A 5(a, 61)) V (R(a) A 5(a, b 2 )) V . . and the 
MarkoView introduces a correlation between all tuples in 
the lineage expression. Here the MarkoView introduces a 
correlation between a large number of tuples, in turn form- 
ing a large clique in the associated Markov Network. The 
view VI in Fig. 1 is of this type, because the year attribute 
of Student p is projected out. 

Example 3. An example of a large MVDB is given in 
Fig. 1. The set of possible tuples Tup are defined by the 
deterministic tables at the top, and by the three queries in the 
middle of Fig. 1 (which also define the weight function w for 
the probabilistic tables). The Marko Views are VI, V2, V3 at 
the bottom of Fig. 1. The MVDB has over 6 M probabilistic 
tuples (correlated) and over 6M deterministic tuples. 

Thus, MarkoViews allow us to express both positive and 
negative correlations between probabilistic tuples. They are, 
however, strictly less expressive than MLNs, because they 
only allow UCQ as features. The advantage of imposing 
such a restriction is that it allows us to translate query eval- 
uation of UCQs on MVDBs to query evaluation of UCQs 
on tuple-independent databases, as we show in the next sec- 
tion. For example, MarkoViews cannot express a feature 
like "transitively closed", which would be written in MLN's 
like: ((R(x, y), R(y, z) R(x, z)),w). One can express this 
in MarkoViews if we extend them to allow negations: 

V(x,y,z) [1.0/w] :- R(x,y) ,R(y ,z) ,not R(x,z) 

While the MLN rewards every grounding of R(x, y), R(y, z) =>■ 
i?(x, z) by a factor w : the MarkoView penalizes every vi- 
olation by a factor 1/w: the two features are equivalent. 
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The technique that we describe in the next section applies 
to this view too, i.e. queries can still be translated to tuple- 
independent database; the problem is that the new query 
contains negation, which is less well understood in prob- 
abilistic databases. For that reason, we are currently re- 
stricting Marko Views to be UCQs, without negation. 

3. TRANSLATING MVDB TO INDB 
3.1 Main Idea 

Consider two possible tuples R(a), S(a) with weights wi, u>2 
and the MarkoView V(x)[w] : -R(x),S(x), where w is a 
constant. The four possible worlds have weights: 

1 Wi W'2 WW1W2 

We show how to reduce the probability P(Q) to the prob- 
ability of a query on a tuple-independent databases. Any 
Boolean query Q corresponds to a subset of worlds; there 
are 2 4 = 16 inequivalent Boolean queries. Then P(Q) — 
$(Q)/Z, where <£(Q) is the sum of the weights correspond- 
ing to the worlds that satisfy Q. For a very concrete exam- 
ple, if Q = R(a) V S(a) then <3>(Q) = w\ + W2 + ww±W2, and 

P(Q) = (Wi + W2 + WW1W2) / (1 + Wi + W2 + WW1W2). 

Consider now a tuple-independent database over three 
relations, R,S,NV, with three possible tuples R(a), S(a), 
NV(a); their weights are wi,W2,wo, where w\,W2 are as 
above and wo will be determined shortly. Consider the hard 
constraint -iW, where W = R(a) A 5(a) A NV(a). If one 
defines V(a) = ^NV(a), then ~^W = (R(a),S(a) V(a)). 
Seven out of the eight possible worlds satisfy -iW, and their 
weights are: 



^NV(a) 


1 


Wi 


U>2 W\W2 


NV(a) 




WqWi 


W0W2 


Total: 


1 + Wq 


(1 + W )W! 


(1 + Wq)W2 W\W2 



We write $0, Zq (= <I>o(true)) and Pq for the weight func- 
tion, the partition function, and the probability defined by 
this tuple-independent database. Suppose we want to com- 
pute P(Q) for some query Q over the schema R, S, and 
consider its weight <£o(Q A —W) in the new database: each 
column in the table above is either entirely included in the 
sum or is not included at all, because Q does not refer to 
NV, only to R and S. Thus, <I>o(QA-iW) is a sum of a subset 
of the weights in the last row, labeled "Total" . Set wo such 
that w = 1/(1 + w ): then 4 $ (Q A-W) = (1 + w ) • <&(Q). 
For example, if Q = R(a) V S(a) then: 

$o(Q A ~^W) =(1 + Wo)W! + (1 + W )W 2 + W\W2 

1 



Therefore: 



P(Q) = 



= (1 + Wo) • (Wi + W 2 + 

=(l + w )'Q(Q) 
HQ) _^>o(OA-Ty) 



W1W2) 



$(true) 

Po(Q A - 



W) _ P (Q V W) - P (W) 



Po(rW) 



1-PoiW) 



4 $(Q) is a sum of a subset of weights in l,wi,W2,wwiW2 
while $o(Q A ~^W) is a sum of the corresponding subset of 
weights in (1 + w ), (1 + ^0)^1, (1 + ^0)^2,^1^2,. 



In summary, we have reduced the problem of evaluat- 
ing P(Q) on an MVDB to the problem of evaluating the 
probabilities Po(Q VW) and Po(W) on a tuple-independent 
database. Note that we can also express it as a conditional 
probability, P(Q) = Po(Q\^W); we prefer to use the expres- 
sion above because both probability expressions Po(Q V W) 
and Po(W) are for UCQ's, for which query evaluation on 
tuple-independent databases is very well understood. 

3.2 Main Theorem 

Definition 5. Consider an MVDB D = (Tup,w,V) 
over the relational schema R. Let Tup v be the set of all 
possible tuples in all views, and wy : Tup v — > [0, 00] be 
their weights (Sect. 2.4)- 

Let NV denote the relational schema having one relation 
symbol NVi for each MarkoView Vi . 

The tuple-independent database associated to D is the fol- 
lowing database over the schema RUNV: Do = (Tup , wo), 
where the set of possible tuples and the weight function are 
defined by: 

Tup =Tup U Tup A V 
Tu Vnv ={NV t (a) I Viia) e Tup v J 

'w(t) if te Tup, 
z/teTupv 



w (t) 



l-W V (t) 



w v (t) 



In other words, to compute the INDB from the MVDB 
one proceeds as follows. All deterministic or probabilistic 
tables in MVDB become independent tables in the INDB, 
with the same weights. (A deterministic table in the MVDB 
has all weights = 00, hence it remains a deterministic table 
in the INDB.) In addition, create a new relation NVi for 
each MarkoView Vi : the possible tuples are all the possible 
tuples in the view Vi, and their weights are wo = (l — w)/w, 
where w is the weight defined by the MarkoView for that 
tuple. (Note that w = 1/(1 + wo), a fact we will use in 
the proof.) The next theorem is the main theoretical result 
in this paper, and key to our technique. It says that, in 
order to compute UCQ queries on the MVDB, it suffices to 
compute UCQ queries over the associated INDB: 

Theorem 1. Let V = {Vi, . . . , V m } be the MarkoView 
in the MVDB. For i = l,m let Qi be the UCQ defining the 
view Vi. Denote Wi the Boolean query 



Wi =3xi.NVi(xi) AQi(xi 



(4) 



Further define W = \J i Wi (this is also a Boolean UCQ 
query). Then, for every Boolean query Q, the following 
holds: 



P{Q) 



Po(Q V W) - P (W) 



(5) 



1 -Po(W) 

Proof. We follow the same steps as in Sect. 3.1. We 
start by computing $(Q) according to Def. 4: 

$(Q)= Yi *( j )= e n w w 

JCTup JCTup teTup v :J|=F t 

Recall that the MLN associated to the MVDB has two sets 
of features F t : for t £ Tup and for t £ Tup v . Thus, <E>(J) 
has two parts: TlteJ w (t)i which is the same as, &o(J), the 
weight of J in the INDB; and the product of all the weights 
of features satisfied by the world J. This justifies $(Q). 
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Next, we compute <E>o(Q A ~^W): 

I:I\=^W 



(6) 



The possible world / ranges over all subsets of TupUTup^y , 
and, hence, it can be written as / = J U K, where J C Tup 
and K C Tup^^. Fix the J component of /. Fix a 
MarkoView (Vi,w,Qi(xi)) and a possible tuple in this 
MarkoView, t = Vi(a) G Tup v .. There are two cases. 
Case 1: a Qi(J); equivalently, J ^= F t . In that case, we 
can satisfy the constraint Qi(xi) ^> -iNVi(xi), by either in- 
cluding or not including the tuple t = NVi (a) in K: the sum 
of the two weights is 1 + w (t). Case 2: a Qi(J); equiva- 
lently J |= Ft. In that case, in order to satisfy the constraint 
we must remove the tuple t in K, hence its multiplicative 
contribution is 1. Thus, we rewrite Eq. 6 grouping by J: 

J,K:JUK\=^W J t:Jp=F t 

• n i=^zw)- n tt^= p -* (<3) 



t:J\=F t 



t:J\=F t 



where P = Y\ t (l + wo(t)). In the last line we used the 
fact that 1/(1 + wo (£)) = w(t). The theorem follows now 
by noting that P(Q) — $(Q)/$(true), and repeating the 
argument at the end of Sect. 3.1. □ 

We end this section with three observations. First, we 
note that the lineage of Q V W is no larger than the lineages 
of Q and W combined. In fact, the lineage of Q V W is 
precisely the disjunction of the two lineage expressions of Q 
and W respectively. Thus, our translation does not add any 
complexity to the probabilistic inference problem. Second, 
we note that the translation gives us an immediate tool for 
identifying tractable cases. The UCQ queries that can be 
evaluated in PTIME over INDB are fully characterized in 
[8], and are called safe queries. An immediate corollary of 
Theorem 1 is that query evaluation over MVDB is tractable 
if both Q V W and W are safe. Other tract ability results on 
INDB also carry over immediately to MVDBs, for example 
the query compilation results in [15]. Finally, a comment on 
denial views, which are Marko Views where the weight is 0: 
for example view V2 in Fig. 1 is a denial view. In that case 
NV is a deterministic table, since its weight is (1 — 0)/0 = 
oo, and can be dropped entirely from the definition of Wi, 
simplifying Eq. 4. 

3.3 Discussion: Negative Probabilities 

Some of the probabilities in Do may be negative: if w > 1, 
then wo = (l — w)/w < 0, and the probability po = wo/(l + 
wo) = 1 — w is negative. This may raise questions about 
the soundness of our approach. However, negative prob- 
abilities have already been considered before; it has been 
proven that probability theory can be consistently extended 
to allow for negative probabilities [2] , and there is interest in 
applying them to quantum mechanics [6] and financial mod- 
eling [13]. In our setting, the negative probabilities have a 
much more benign role: they are simply numbers that need 
to be plugged into the right hand side of Eq. 5 to make the 
equality hold. Every query answer P(Q) will be a correct 
probability, in [0, 1], even if the probabilities Po on the right 



are negative. All familiar equalities still hold for Po- for ex- 
ample, the rules for negation Po(^Q) = 1— Po(Q), and inclu- 
sion/exclusion P (Qi VQ 2 ) = Po(Qi)+Po(Q2)-Po(QiAQ 2 ) 
still hold; similarly, if Qi,Q2 are independent queries (their 
lineages have no common variables) then the laws of inde- 
pendence continue to hold: Po(Qi A Q2) = Po(Qi)P(Q2) 
and P (Qi V Q 2 ) = 1 - (1 - Po(Qi))(l - P 2 (Q 2 )); similarly, 
Shannon's expansion formula holds. In fact all exact infer- 
ence methods (Davis-Putnam procedure, tree- width based 
methods, OBDD constructions) work without any modifica- 
tion on probability spaces with negative probabilities. We 
take advantage of this fact in the next section. 

Approximate methods, however, no longer work out-of- 
the-box. For example, the inequality Po(QiVQ2) < Po(Qi) + 
Po(Q2) must be replaced with the weaker |Po(Qi V Q2) | < 
|Po(Qi)| + |Po(Q2)|; another issue is that, approximation 
methods may no longer return final values in [0,1]; it is also 
unclear how to run sampling-based methods. In this paper 
we do not consider any approximation or sampling meth- 
ods, instead restrict the discussion only to exact probability 
computation. We show next how to do this quite effectively. 

4. COMPILING MARKO VIEWS 

We have translated the problem P(Q) on an MVDB into 
the problem Po(Q V W) on a tuple-independent database. 
Thus, from now on we will only consider tuple-independent 
databases. 

Even though query evaluation over INDB has been well- 
studied, there is an important distinction in our setting. 
While the lineage of Q is typically small, the lineage of W 
is usually large, especially as we try to capture more corre- 
lations. For example, the DBLP data in Fig. 1 has over 6M 
probabilsitic tuples, and many of them (if not all) will be 
included in the lineage of W. Since W is defined offline, we 
employ the strategy of pre-compiling W, in order to speed 
up query evaluation of Po(Q V W) online. In this section, 
we will discuss the data structure MV- Index into which we 
compile W along with the algorithm for compilation. MV- 
Index has been designed to speed-up the online evaluation 
of Po(Q V W), which we discuss next. 

Throughout this section we assume a tuple-independent 
probabilistic database D . We associate to each tuple a 
Boolean variable X±, X2, . . . , X n , Sect. 2.1. We consider 
only Boolean queries in this section. W is already a Booelan 
query (Eq. 4); the user query Q is typically not a Boolean 
query, but for the purpose of query evaluation we substitute 
its head variables with an answer tuple, thus transforming 
it into a Boolean query. We denote <3>q the lineage of the 
query Q on the probabilistic database. <3>q is a Boolean for- 
mula using variables Xi, . . . ,X n (see, e.g. [33]). Figure 3 
shows a probabilistic database, a query Q, and its lineage 
expression $q. Notice that the lineage of a disjunction is 
the disjunction of the lineages, $g v w = $q V <&w- 

4.1 MV Index 

An MV-Index consists of a set of OBDD augmented with 
certain pre-computations and indices that we describe be- 
low. But first we briefly review OBDD. 

OBDD: An Ordered Binary Decision Diagrams, OBDD 
[5] is a rooted DAG, where internal nodes are labeled with 
Boolean variables and have two outgoing edges, labeled 
and 1; sink nodes (leaves) are labeled or 1. There are two 
constraints: every path from the root to a leaf must visit 



1166 



R 



A 

A 






CLl 






a 2 


x 2 




A 


B 




ai 


h 


Vi 


ai 


b 2 


Y 2 


a 2 




Y 3 


a 2 


64 


Ya 



$ Q =Xiyi v X!Y 2 v x 2 y 3 v x 2 y 4 




Q=R(x),S(x,y) 



Figure 3: A query Q on a probabilistic database, its 
lineage <£>q, and an OBDD for <£>q 

each variable at most once, and any two paths must visit 
the variables in the same order (missing variables are OK), 
see an example in Figure 3. Given an OBDD for <I> one 
can compute the probability Po($) in linear time. Denote 
p{u) the probability of the Boolean formula encoded by the 
OBDD rooted at node u. If u is a leaf node, set p(u) = 
or p(u) — 1 (depending on its label); otherwise it is labeled 
with some variable, say Xi, and has two children, say uo, u\ 
corresponding to the outgoing edges labeled with 0,1 respec- 
tively We set p(u) = (1 - P (Xi)) ■ p(u ) + P (X t ) ■ p( Ul ). 
This formula (Shannon expansion) also holds when Po(Xi) 
is negative. The size of an OBDD is the total number of 
nodes in it. The width at level i is the number of nodes 
labeled with variable X% and width is the maximum width 
at any level. Note that the size of OBDD is at most the 
width times the number of variables. 

The MV-Index An augmented OBDD for $ is an OBDD 
where each node u is annotated with two quantities. First, 
ix.probUnder stores p(u). Second, ^.reachability stores the 
sum of the probability of all paths starting from root to u. 
The probability of a path is defined as a product of the fol- 
lowing factors: for an edge Xi — 1 we include the factor 
Po(Xi), and similarly for Xi = we include (1 — Po(Xi)). 
To see the intuition, suppose we want to compute Po(cpA^) 
for some "small" expression </?, and assume ip = X±, the root 
variable of the OBDD. Then we just return p*n.probUnder, 
where p is the probability of X\ and u is its 1-child. Assume 
now cp — Xi, and that there are c nodes in the OBDD la- 
beled Xi, m, u 2l ■ ■ . , u c . Assume every path from the root to 
a sink contains one of these nodes. Let v±, v 2l . . . , v c be their 
1-children. Then Po(Xi A <£>) = p * Y^j=i u o -reachability * 
I'j.probUnder. When the width c is small, then the aug- 
mented OBDD allows us to compute these probabilities ef- 
ficiently 

Finally, an MV-Index consists of a set of augmented OBDD, 
each of them associated with a particular key. These OBDD 
are over disjoint set of variables. Besides the OBDD, it keeps 
two indices. InterBddlndex, given a tuple, returns the key 
of the OBDD where the tuple is located. IntraBddlndex re- 
turns all the nodes in the OBDD which correspond to that 
tuple. For e.g., in our previous example where we were com- 
puting P(X±), we still needed to locate all the nodes labeled 
with X\ . These indices enable us to do that in constant time. 

4.2 Constructing the MV-Index 

In this section, we will describe our algorithm to construct 



an OBDD for a UCQ. Given an OBDD, adding the pre- 
computations and indices needed to make it an MV-Index 
are straightforward and hence skipped in this section. 

Let II be the order in which an OBDD visits the vari- 
ables on each path. (E.g. n = X±, Yi, Y 2 , X 2 , Y3, Y4 in 
Fig. 3). The order II uniquely determines the OBDD [35], 
up to merging of equivalent nodes, therefore the problem 
of finding a small OBDD is equivalent to finding a good 
order II (meaning one that leads to a small OBDD). De- 
note OBDDn(<5>) the OBDD of $ with order II. If $ = 
$1 V $2, or $ = $1 A$2, and we are given OBDDs Gi,G 2 
for $i,$2 with the same order II, then one can compute 
OBDDu(^) in time 0(|Gri||Cz2|) by a procedure called syn- 
thesis. CUDD [32], a widely popular package for OBDDs, 
uses this synthesis procedure. It starts with some order II 
and synthesizes the OBDD traversing $ recursively. 

We do the following improvement over the synthesis pro- 
cedure in CUDD. Suppose <£>i, $2 are independent, i.e. they 
use disjoint sets of Boolean variables, then one can synthe- 
size $1 V $2 more efficiently: stack the OBDD's d, G 2 on 
top of each other, and redirect every 0-labeled leaf of G\ to 
the root of G 2 (for $1 A <3>i one would redirect the 1-labeled 
leaves). We call this concatenation. The size of the new 
OBDD is only \G\ \ + \G 2 \, and, unlike synthesis, concate- 
nation is a constant time operation. Thus, concatenation 
represents a major improvement over synthesis. We explain 
next where we can use concatenation when computing the 
OBDD of a UCQ. 

Consider a Conjunctive Query(CQ) Q. A root variable in 
Q is a variable x that appears in all atoms of Q. One can 
write Q = Q[ai/x] V Q[a 2 /x] V . . ., where ai, a 2 , . . . form the 
active domain of x. It is not hard to see that if Q has no self- 
joins then Q[di/x] and Q[aj/x] have no tuples in common; 
hence the OBDD of Q can be obtained by concatenating 
the OBDD of Q[ai/x]. This has been illustrated for Q = 
R(x),S(x,y) in Fig. 3. 

In general for a UCQ Q = Qi V Q 2 V . . ., where Qi are 
CQ, let Xi be a root variables of Qi. We write Q = 3x±.Qi V 
3x 2 .Q 2 V . . . = 3z.(Qi[z/xi] V ' Q 2 [z/x 2 ] V . . .). The new vari- 
able z occurs in all atoms of all conjunctive queries. We call 
z a separator variable if any two atoms with the same symbol 
contain z on the same attribute position. Let ai, a 2 , . . . be 
the active domain for z. Then once again the same property 
holds : Q = Q[a\/z] V Q[a 2 /z] V . . ., and the queries on the 
right are independent, i.e. they do not have any common 
tuples. For e.g., let Q = R(xi), S(xi,yi) V T(x 2 ), S(x 2 ,y 2 ). 
Both x\ and x 2 are root variables, and we write the query 
as follows (showing quantifiers explicitly now): 

Q =3z.[3 yi .(R(z) A S(z, yi)) V 3y 2 .(T(z) A S(z, y 2 ))] 

Thus, Q = Q[a 1 /z]VQ[a 2 /z] V ... and the OBDD for Q has 
a size that is the sum of the OBDD for Q[ai/z]. In general 

Proposition 1. If z is a separator in Q, then there ex- 
ists an OBDD for Q of size at most the sum of OBDD of 
Q[di/z],i = 1 , n, where a\ , a 2 , . . . , a n is the active domain. 

Finally, consider Q — R(x±), S(xi, yi) V S(x 2 ,y 2 ),T(y 2 ): 
this query does not have a separator variable. For such 
queries we fall back on general tools like CUDD. Then there 
are hybrid cases like Q — Qi V Q 2 , Qi = R(x) ,S(x,y), R(y) , 
Q 2 = S(x,y)T(x). Here only Q 2 has a root variable, so we 
will choose II such that we can compute OBDDu(Q2) by 
concatenation, then use synthesis to compute OBDDu(Qi) 
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using the same order II. Having reviewed the concepts from 
our prior work, we are now ready to describe the construc- 
tion, which is the novel part of this section. 

Fix a relational schema R = {Ri, . . . ,Rk}- Order the re- 
lation names from smaller to larger arities. Let tt = {ttr x , . . ., 
7TR k } be set where each Tn is a permutation on the attributes 
of the relation R4. That is, if arity(Ri) = m then m is 
any permutation on 1, . . . , m. Consider a database instance 
/, over an ordered active domain a\ < CL2 < ... < a n . 
Then tt defines an order II on all tuples in /, as follows. 
II = IIi, II2, . . . , II n , where each IT, is the order obtained 
recursively as follows: for each relation Ri, retain from D 
only those tuples where the first attribute in Ri (according to 
itRi) is dj, then project out that attribute, and compute Ilj 
recursively on the smaller database. For example, consider 
the schema R(A),S(A,B), and the permutation ttr = (A) 
and its = (A, B). Consider the database instance in Fig. 3), 
where the active domain is ordered by a\ < a 2 < bi < 62 < 

b 3 <b A . Thenn = Xi,yi,y 2 ,x 2 ,y 3 ,n. 

Let tt be given and let Q be a query. If x, y are two vari- 
ables then we write x >^ y if whenever y occurs in an atom 
on some attribute B, then x also occurs in that atom on 
some attribute A, and A comes before B in the permutation 

7T Ri . 

Given tt and a query Q, the following recursive procedure 
ConOBDD(ir,Q) constructs OBDD u (Q), where II is the 
permutation on the tuples associated to tt: 

• Rl Q = Qi V Q2 : if Qi,Q2 have no relations in 
common, concatenate, else synthesize. 

• R2 Q = Qi A Q2 : if Qi,Q2 have no relations in 
common, concatenate, else synthesize. 

• R3 Q = 3x.Qi : if x >^ y for all variables y in Qi, 
then concatenate, else synthesize. 

• R4 Q = R(a) : trivial 

We choose heuristically tt such as to minimize the num- 
ber of synthesis steps in R3. In particular, if Q has a sep- 
arator variable, then we always choose tt such that every 
attribute holding a separator variable occurs first in the 
permutation tt. Call a query Q inversion- free if there ex- 
ists tt such that only concatenations are performed in R3. 
(This is equivalent to the definition of inversion-free queries 
in [15].) Let {ai, . . . , a n } be the active domain. The follow- 
ing proposition, based on [15], gives guarantees on the size 
of CouOBDD(tt, Q) in certain cases. 

Proposition 2. Let N be the size of the OBDD returned 
by CouOBDD(tt,Q). (a) If Q admits a separator variable 
z, then N — Nj, where Nj is the size of the OBDD of 
Q[dj I z], j = 1,71. (b) If Q is inversion- free, then the OBDD 
has constant width; hence N — 0(n). [15]. 

4.3 Querying an MV-index 

Our goal is to compute efficiently Eq.5, whose numera- 
tor is P (Q V W) - Po(W) = Po(Q A -nW). The OBDD 
for —W is obtained immediately from the OBDD for W, 
by switching the and 1 sink nodes. In this section we 
will show how to use an MV-Index for ^VK, to compute ef- 
ficiently Po(Q A ~^W): we call this operation intersection. 
Denote Gw = OBDDu(~^W). Given Q, we first construct 
Gq = OBDDu(Q)- Note that, although II is imposed by 



W, constructing Gq is usually quite efficient, because lin- 
eage of Q is typically small. A naive next step would be to 
compute Gq A Gw, but this requires traversing the entire 
index. We briefly review our improvements over the naive 
algorithm next. 

MVIntersect MVIntersect uses Gq = OBDDu(Q) to 
guide a search through Gw = OBDDu(~^W) and sum the 
probability of only those worlds that satisfy Q via a top- 
down algorithm. One of the main challenges in doing a top- 
down intersection is maintaining a cache for memoization. 
This is well-studied (c.f. [14] for a top-down algorithm), 
so we don't discuss it here. Since Gw^Gq have same vari- 
able order, our algorithm just traverses Gw and prunes out 
branches where Gq is false. To do this we could implement 
a DFS traversal of Gw and in the stack, we maintain the 
nodes from both Gw and Gq. Whenever we pop false 
from Gqj we don't traverse the subtree below. The stack 
here though can become very big and the node from Gq 
is almost always the same, since Gq is very small. MVIn- 
tersect exploits this by not adding the same node from Gq 
consecutively, but keeping a count of how many times the 
same query node has been pushed. 

CC-MVIntersect Since our algorithm is main-memory 
and considering the increasing gap in the access time of 
cache and memory, we optimized our data structure to be 
more cache-conscious and minimize random accesses. A typ- 
ical BDD data structure, for instance used in CUDD, is 
to store bdd nodes as pointers and each node contains the 
pointers to its neighbors. We improve upon it by keeping the 
bdd nodes in a vector sorted according to the DFS traversal 
of the obdd. Querying the obdd now can be done by a se- 
quential traversal of this vector. The new bdd nodes though 
may need to store some additional information about their 
neighbors now. We call this approach CC-MVIntersect. 

Proposition 3. Let G w be the OBDD for -W . Let X< 
and Xj be the first and last Boolean variable ( according to 
the permutation II ) occurring in the lineage of Q, and let 
ra = j — i + 1. The CC-MVIntersect procedure runs in time 
0{m • w), where w is the width of Gw- 

Finally, we would like to point out that if W is inversion- 
free then since its OBDD is of constant width, the runtime 
of CC-MVIntersect is linear in the span of the query Q, i.e. 
the distance between the first and last variable in its lineage. 

5. EXPERIMENTAL EVALUATION 

We report here our experimental evaluation of MVDBs, 
on real data obtained from [19]. We addressed four ques- 
tions. How do Marko Views, and indexed Marko Views 
compare to other approaches for probabilistic inference on 
large Markov Networks? How effective is the MV-index con- 
struction algorithm compared to the standard approach for 
constructing OBDDs? How effective is the MV-index-based 
query evaluation method, how significant is the improve- 
ment of the CC-MVIntersect algorithm over MVIntersect? 
And, finally, how do MVDBs scale to the entire DBLP 
dataset? 

Set Up For probabilistic inference, we compare our ap- 
proach with Alchemy [37], the de-facto standard inference 
engine for MLN. For OBDD construction, we have extended 
CUDD [32], a widey-used OBDD package; for the OBDD 
experiments we compare native CUDD with our obdd con- 
struction algorithm. Our implementation is written in C++. 
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We used Postgres 9.0 for our experiments which are run on 
a 2.66 GHZ Intel Core 2 Duo, with 4GB RAM, running Mac 
OS X 10.6. The dataset used is DBLP [19] and we consider 
features from Fig. 1. 

5.1 Comparison with Alchemy 

To compare against Alchemy, we construct an MLN using 
features over Advisor p and Student 25 i.e., Vi,V2, Fig. 1. We 
did not use Affiliations 23 , because the characters in the af- 
filiation string violated Alchemy 's requirement for constants. 
MLN do not allow features to have parameterized weights 
like in MVDBs, instead MLNs require a constant weight for 
each feature. One alternative is to materialize each feature 
into multiple features, one for each value of the weight; this 
would have increased the number of features. We opted in- 
stead to use constant weight in Alchemy, for simplicity. In 
MVDB we continued to use the weights exactly as described 
in Fig. 1. 

Not surprisingly, Alchemy did not scale to the entire dblp 
dataset; we could scale it only up to aid = 10,000, in 
Student 29 (aid, year) , where aid ranges upto 1M. In our ex- 
periments, we generate 10 datasets, with domain of aid 
from 1 to i * 1000, i = 1 . . . 10. The lineage size of the 
MarkoView, i.e. tuples involved in the constraints (in 
other words the size of Sect. 4) is plotted in Fig. 4. 
Over each such dataset we ran two queries of the form find 
the advisor of some student X, and find all students of some 
advisor Y. Fig. 5 and Fig. 6 show the running time compar- 
ison of Alchemy vs Marko Views. We show both the total 
execution time and also the time alchemy reports it spent in 
just sampling. It is known that Alchemy is very inefficient 
during the grounding phase, and that database techniques 
can speed up this phase considerably [20], therefore the in- 
teresting line in Fig. 5 is the lower line, Alchemy-sampling, 
which is Alchemy's reported sampling time, and is a bet- 
ter measure (likely a lower bound) on the total probabilistic 
inference time. The sampling method used is MC-Sat[25]. 

The results in Fig. 5 and Fig. 6 show that Alchemy's MC- 
Sat algorithm is comparable (within a factor of 5, up or 
down) with evaluating the Marko Views directly by con- 
structing the OBDD. Note that OBDDs are an exact infer- 
ence method while MC-Sat is an approximate method; and, 
also, OBDDs are not ideal for pure probabilistic inference, 
but are more appropriate for compilation. Once we use them 
for this purpose, by constructing an MV-index, the perfor- 
mance of the Marko Views increases dramatically, and re- 
mains mostly constant as we increase the data size. 

5.2 Comparison with CUDD 

Here we studied the OBDD construction time, and com- 
pared it to the OBDD construction time as performed by 
native CUDD. The running time depends on the size of 
the resulting OBDD, and therefore we needed a feature 
that allowed us both to increase the size of the OBDD lin- 
early, and for which the OBDD constructed by native CUDD 
was the same as that resulting from our optimization. The 
MarkoView V2 in Fig. 1 had both properties. As we varied 
linearly the domain of aidl in Advisor p (aidl , aid2) from 
[1000] to [10000], the size of the resulting OBDD varied lin- 
early, as shown in Fig. 7. Furthermore, we checked that the 
size of the OBDDs returned by the two methods were in- 
deed the same. Figure 8 shows that our algorithm was two 
orders of magnitude more efficient than CUDD's OBDD con- 



struction. Since aidl is a separator, our approach exploits 
concatenation, which is much more efficient than standard 
OBDD synthesis used by CUDD. Since CUDD is effective at 
detecting equivalent nodes in the OBDD, it constructs the 
same OBDD, but it takes a lot of time to achieve the same 
result. We could not use CUDD to construct an OBDD for 
the entire DBLP dataset, even for V2: we estimate it would 
have taken several hours. 

5.3 MV-index-based Query Evaluation 

Recall that our query evaluation Po(Q V W) is based on 
optimized intersection of the OBDDs for Q with the MV- 
index for W. Here we compare two algorithms: MVIntersect 
and CC-MVIntersect (recall that that CC stands for cache- 
conscious) . We used the same setting as in previous section. 
We used a simple query Q whose lineage consisted of 20 tu- 
ples chosen as a worst case scenario: it forced the system 
to traverse entire MV-index, rendering all pre-computations 
and indices useless. Thus, query evaluation requires a com- 
plete intersection of the two OBDDs. Figure 9 shows the 
running time of the two algorithms. As expected, the run- 
ning time varies linearly in the size of the MV-index (which, 
recall Fig. 7, we designed carefully to be linear in the size 
of the data), because the entire OBDD must be traversed. 
The cache-conscious improves by a factor of 2 over the plain 
algorithm. We note that this is the worst case scenario; in 
typical cases, the query needs to traverse only a fraction of 
the MV-index, as will become clear next. 

5.4 Scalability to a Large Dataset 

We finally report our scalability results on the entire DBLP 
dataset, as described in Fig. 1. The MarkoViews have a 
separator, hence the MV-index is obtained by concatenating 
many small OBDDs; their total size is 1.38M. Note that not 
all probabilistic tuples ended up in the MV-index, because 
some did not participate in any views. It took under one 
hour to construct the OBDD and index. 

We evaluated 10 queries, of form find all students of an 
advisor X, and find the affiliation of a person Y. The running 
times are reported in Fig. 10, Fig. 11 respectively. In all 
cases we used the CC-MVIntersect. As one can see, the 
running times are below 5ms for all queries, and many are 
below 1ms. Query evaluation time includes the round trip 
call to Postgres, to compute the query's lineage, then the 
time to access the OBDD index, which is a main memory 
data structure. Since each query includes a constant (the 
name of the advisor X, or the name of the person Y), only a 
small portion of the full OBDD had to accessed at runtime, 
which explains the performance of query evaluation. Note 
that all probability computations are exact: this is unlike 
Alchemy's which are approximate. 

5.5 Discussion 

We have demonstrated how MV-indexes can be used to 
dramatically speed up query evaluation. The key ingredient 
that makes the index construction possible is the transla- 
tion of the query evaluation from a Markov Network to a 
tuple-independent database, since OBDDs are only possi- 
ble over independent variables. The OBDD construction is 
non-trivial. However, once the index is constructed, the per- 
formance of query evaluation becomes comparable to evalu- 
ating that query in postgres. 
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Figure 10: Querying students of an Advisor 

Future work is needed to assess the applicability of MVDBs, 
both in terms of modeling power and in terms of perfor- 
mance. Typical applications of MLN's [26, 31, 27] use 4-10 
features, most of which can already be expressed as Mark- 
O Views, but they use much smaller data sets. 

6. RELATED WORK 

Query Evaluation on Tuple Independent Databa- 
ses. This problem reduces to that of evaluating the proba- 
bility P(3>) of a Boolean formula $ (over the Boolean vari- 
ables Xt) called the lineage of the query Q. There are two 
lines of research on query evaluation on INDB's. One aims 
at identifying classes of queries for which P(Q) can be com- 
puted in polynomial time in the size of the database: these 
are called safe queries. For UCQ's without inequality pred- 
icates (like x < 5 or x ^ y), there exists a syntactic charac- 
terization of safe queries that is complete, i.e. a dichotomy: 
either Q is safe and then one can compute P(Q) in PTIME, 
or Q is unsafe and then computing P(Q) is #P-hard. The 
other line of research aims at developing effective heuristics 
for computing P(3>). In this formulation the problem is re- 
lated to model counting, for which several effective heuristics 
exists. The most popular is the Davis-Putnam procedure, 
which is based on Shannon expansion, see for example [3], 
and which has been used in probabilistic databases in [17]; 
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Figure 11: Querying Affiliation 

refinements of this procedure also exist [10]. Based on ideas 
similar to the Davis-Putnman procedure, a number of tech- 
niques have been described for using OBBDs for query eval- 
uation on probabilistic databases [21, 22, 15]. Finally, these 
evaluation methods have been extended with several approx- 
imation techniques [24, 11]. 

Markov and Bayesian Networks in Probabilistic 
Databases. Several proposals exists for extending proba- 
bilistic databases to represent Markov Networks or Bayesian 
Networks. For example, in [16] the probabilistic database 
is defined directly as a Markov Network. The system uses a 
novel indexing techniques for the junction tree decomposi- 
tion of the network, allowing queries on a large database to 
be evaluated efficiently at runtime. The method works very 
well, but only when the tree width of the Markov Network 
is small, otherwise the method is intractable. Note that the 
Markov Networks in MVDBs have very large cliques, hence 
very large tree widths. Tuffy [20] implements MLN's di- 
rectly in a relational database system. Its focus is less on 
developing new algorithms, instead it is on leveraging query 
processing in the database in order to implement existing 
MLN algorithms. Two other projects [36, 34] implement 
MCMC inside a relational database for efficiently answer- 
ing general SQL queries over a CRF model, commonly used 
in Information Extraction. The approach taken by these 
systems is to scale up existing general purpose probabilistic 
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inference methods. Our MVDB approach differs from these 
in that we do not optimize any existing algorithm for infer- 
ence over complex probabilistic models, but propose a new 
approach by which we translate from a complex probabilistic 
model to a tuple-independent probabilistic model. 

7. CONCLUSION 

We described a new approach to probabilistic databases, 
which allows complex correlations to be defined between the 
tuples in a database. Our new approach is based on Mark- 
O Views, which are a restricted form of a Markov Logic Net- 
work feature. We made two contributions that allow queries 
to be processed very efficiently on such databases. The first 
is a translation from Marko Views into tuple-independent 
databases. The second is a compilation of the Marko Views 
into OBDDs, which dramatically speeds up query execution. 
We have also validated our techniques experimentally. 
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